Properties of contact matrices induced by pairwise interactions in proteins 
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Properties of contact matrices (C-matrices) for native proteins to be the lowest energy conforma- 
tions are considered in relation with a contact energy matrix (_E-matrix) under an assumption that 
the total conformational energy of a protein can be approximated by a sum of pairwise interaction 
energies represented as a product of corresponding elements of these matrices each of which corre- 
sponds to a conformation-dependent function and an sequence-dependent energy parameter. Such 
pairwise interactions in proteins force native C-matrices to be in a relationship as if the interactions 
are a Go-like potential [2] for the native C-matrix, because the lowest bound of the total energy 
function is equal to the total energy of the native conformation interacting in a Go-like pairwise 
potential. This relationship between C- and _E-matrices corresponds to 1) a parallel relationship 
between the eigenvectors of C-matrix and those of i?-matrix and a linear relationship between their 
eigenvalues, 2) a parallel relationship between a contact number vector and the principal eigenvec- 
tors of C-matrix and of _E-matrix, where _E-matrix is expanded in a series of eigenspaces with an 
additional constant term. The additional constant term in the spectral expansion of _E-matrix is 
indicated by the lowest bound of the total energy function to correspond to a threshold of contact en- 
ergy that approximately separates native contacts from non-naive contacts. Inner products between 
the principal eigenvector of C-matrix, that of i?-matrix, and a contact number vector have been 
examined for 182 proteins each of which is a representative from each family of the SCOP database, 
and the results indicate the parallel tendencies between those vectors. A statistical contact potential 
[HE] estimated from protein crystal structures was used to evaluate pairwise residue-residue interac- 
tions in proteins. In addition, the spectral representation of C- and B-matrices reveals that pairwise 
residue-residue interactions, which depends only on the types of interacting amino acids but not on 
other residues in a protein, are insufficient and other interactions including residue connectivities 
and steric hindrance are needed to make native structures the unique lowest energy conformations. 

PACS numbers: 87.15. Cc, 87.14et, 87.15.ad, 87.15.-v 



I. INTRODUCTION 

Predicting a protein three dimensional structure from 
its sequence is equivalent to reproducing a three dimen- 
sional structure from one dimensional information en- 
coded in its sequence. From such a viewpoint, there are 
many studies that try to reconstruct three dimensional 
structures from one dimensional information such as con- 
tact numbers and the principal eigenvector of a contact 
matrix [6] [3 [8j |9]. An important question is not only 
what kind of one dimensional information is needed to 
reconstruct protein structures but also why such infor- 
mation is critical to reconstruct protein structures. 

Let us think about a distance matrix each element 
of which is equal to distance between atoms or residues 
specified by its column and row. Information contained 
in the distance matrix is equivalent with the specification 
of three-dimensional coordinates of each atom/residue, 
except that a mirror image of the native structure can- 
not be excluded in distance information. Reconstructing 
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a distance matrix from one-dimensional vectors requires 
in principle the specification of all eigenvectors as well 
as eigenvalues. In other words, for an N x N matrix, N 
iV-dimcnsional vectors are required. However, protein's 
particular characteristics may allow the reconstruction of 
a distance matrix with fewer one-dimensional vectors. 

A contact matrix whose element is equal to one for 
contacting atom/residue pairs or zero for no-contacting 
atom/residue pairs or more generally a value between one 
and zero representing the degree of contact is a simplifi- 
cation of a distance matrix with two categories, contact 
or non-contact for the distance of atom/residue pairs, 
but keeps almost all information needed to reconstruct 
three-dimensional structures of proteins. In the case of a 
contact matrix consisting of discrete values, one and zero, 
for residues, Porto et al. [7] showed that the contact map 
of the native structure of globular proteins can be recon- 
structed starting from the sole knowledge of the contact 
map's principal eigenvector, and the reconstructed con- 
tact map allow in turn for the accurate reconstruction of 
the three-dimensional structure. 

A vector of contact numbers, which is defined as 
the number of atoms or residues in contact with 
each atom/residue in a protein, is another type of 
one-dimensional vector that is often used as a one- 
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dimensional representation of proteins structures [TUl ITT1 
112] , and may be similar to but not the same as the prin- 
cipal eigenvector of a contact matrix. Kabakgioglu et al. 
[B] suggested that the number of feasible protein confor- 
mations that satisfy the constraint of a contact number 
for each residue is very limited. 

A question is why the principal eigenvector of a con- 
tact matrix and a contact number vector contain signif- 
icant information of protein structures. Here, we con- 
sider what properties of contact matrices are induced by 
pairwise contact interactions for native proteins to be 
the lowest energy conformations. For simplicity, a total 
conformational energy is assumed to consist of pairwise 
interactions over all atom or residue pairs. It is further 
assumed that the pairwise interaction can be expressed 
as a product of a conformation-dependent (C-dependcnt) 
factor and a sequence-dependent (S-dependent) factor; 
the S-dependent factor corresponds to an energy param- 
eter specific to a given pair of atom/residues. Here we 
call a matrix of C-dependent factor a generalized contact 
matrix or even simply a contact matrix (C-matrix), and 
call a matrix of S'-dependent factor a generalized contact 
energy matrix or even simply contact energy matrix (E- 
matrix). A simple linear algebra indicates that such a 
total energy function is bounded by the lowest value cor- 
responding to the total energy for a C-matrix in which all 
pairs with lower contact energies than a certain threshold 
are in contact. Such a lower bound is achieved if and only 
if proteins are ideal to have the so-called Go-like potential 
2 . The Go-like potential is defined as one in which inter- 
action energies between native contacts are always lower 
than those between non-native contacts. Real pairwise 
interactions in proteins couldn't be the Go-like potential. 
In other words, real proteins could not achieve this low- 
est bound of a pairwise potential because of atom/residue 
connectivities and steric hindrance that are not included 
in this type of total energy function. How should they 
approach to the lowest bound as closely as possible? The 
lowest bound can be approached by making the singular 
vectors of C-matrix parallel to the corresponding singu- 
lar vectors of i?-matrix with the same value of the singu- 
lar values. Also, in the lowest bound a contact number 
vector tends to be parallel to the principal eigenvectors 
of C-matrix and of B-matrix. The most effective way 
would be to first make the principal singular vector of 
C-matrix parallel to that of i?-matrix. A similar strat- 
egy was used to recognize protein structures by three- 
dimensional threading of protein sequences [HIH!]. Bas- 
tolla et al. [TS] pointed out that the principal eigenvector 
of a contact matrix must be correlated with that of a con- 
tact energy matrix, if the free energy of a conformation 
folded into a contact map is approximated by a pairwise 
contact potential. It was shown that the correlation co- 
efficients of these two principal eigenvectors are actually 
statistically significant in protein folds. However, unlike 
their analyses the lowest bound of the total energy in- 
dicates the _E-matrix to be singular-decomposed with a 
constant term that corresponds to the threshold energy 



to separate native contacts from non-native ones. The 
eigenvectors of i?-matrix may depend on the value of the 
additional constant. 

Based on the indication above, we have analyzed the 
relationships between the principal eigenvectors of C- 
matrix and of iS-matrix and contact number vector by 
examining the inner product of the two vectors. A sta- 
tistical contact potential [H [5] estimated from protein 
crystal structures is used to evaluate pairwise residue- 
residue interactions in proteins. 182 representatives of 
single domain proteins from each family in the SCOP 
version 1.69 database [3] are used to analyze the rela- 
tionship between the principal eigenvectors of native C- 
and of ^-matrices and the contact number vector. Re- 
sults show that the inner product of both the principal 
eigenvectors has a maximum at a certain value of the 
threshold energy for contacts, and that there are parallel 
tendencies between both the principal eigenvectors and 
contact number vector. It is worth noting that the prin- 
cipal eigenvector of native C-matrix corresponds to the 
lower frequency normal modes of the native structure of 
protein. 

In addition, the spectral representation of the con- 
tact and contact energy matrices reveals that pairwise 
residue-residue interactions, which depends only on the 
types of interacting amino acids but not on other residues 
in a protein, arc insufficient and other interactions includ- 
ing residue connectivities and steric hindrance are needed 
to make native structures the unique lowest energy con- 
formations. 



II. METHODS 



Basic assumptions and conventions 

We first assume that the total conformational energy 
of a protein with conformation C and amino acid se- 
quence S of N units can be approximated as the sum 
of pairwise interaction energies between the units. Here 
a single unit may consist of an atom or a residue, al- 
though in most cases we treat a residue as a unit. We 
further assume that each pairwise interaction term can 
be expressed as a product of a C-dependent factor and a 
S'-dependent factor. The C-dependent factor represents 
the degree to which a pair of units are in contact, while 
the S-dependent factor represents an interaction energy 
for a contacting pair of units. In other words, the total 
conformational energy is assumed to be approximated as 

N N 

E C (C,S) = -£]>>,(S)A„-(C) (1) 

• 3 
- N N 

= 2 EE^(^ A ^ C ) + ^oiVc(C), (2) 

* 3 

5£ij(S) = %(S)-e . (3) 
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where £ij(S) and Ay(C) are the S'-dependent term and 
C-dependent term for the pairwise interaction energy be- 
tween the i and jth units, respectively. N C (C) is the total 
number of contacts between units and defined as 



N C (C) 



(4) 



where the generalized contact number rij that is the total 
number of units contacting with ith unit is defined as 



N 



(5) 



Each Aij(C) is a function of coordinates of ith and 
jth units, and assume values between and 1, with 
the diagonal elements always defined to be equal to 0. 
The S'-dependent term £ij(S) can include not only two- 
body interactions but multi-body effects as a mean-field, 
that is, not only depends on the type of a unit pair 
but on the entire protein sequence. We call the matrix 
A(C) = (Ajj(C)) as a generalized contact matrix or C- 
matrix for short. Similarly, we call the matrix (£ij(S)) 
as a generalized contact energy matrix or i?-matrix for 
short. Each element of the energy function of Eq. 0) 
can represent either attractive or repulsive interactions 
but not both. In the next sections, we consider math- 
ematical lower limits of the total contact energy ignor- 
ing atomic details of proteins such as atom/residue con- 
nectivities and steric hindrance. The volume exclusions 
between atoms are assumed to be satisfied and are not 
included in the total energy function. To minimally re- 
flect the effects of steric hindrance, the total number of 
contacts N c is explicitly treated in the evaluation of the 
total energy, Eq. ([2]), by introducing a constant £o i n 
Eq. (J3j . The expression of Eq. can be regarded as a 
special case of Eq. in which e is equal to zero. 

Lower bounds of the total contact energy 

Let us consider lower bounds of the total contact en- 
ergy represented by Eq. under a condition that each 
element of C-matrix can independently take any value 
within < Ay < 1 irrespective of whether or not they 
can be reached in real protein conformations; in other 
words, atom/residue connectivities and steric hindrance 
are completely ignored. 

If one regards 5£ij and Ay as the elements of vectors 
5£(S) and A(C) in 7V 2 -dimensional Euclidean space, it 
will be obvious that the first term of Eq. can be 
bounded by a product of the norms of those two vectors: 



E C (C,S) > 



6£(S)\\\\A(C)\\ + e N c (C), (6) 



where || . . . | means a Euclidian norm. Obviously the 
equality of Eq. is achieved if and only if those vectors 
are anti-parallel to each other: 



where e is a negative constant. 

In addition, there is a simple mathematical limit for 
the total energy of Eq. for which C-matrix is equal 
to H (-5£ij): 

E C (C, S) 

SS » ( C **J + e oN c (C min ) (8) 

i 3 



> 



Aij(C m - m ) — Ho(—6£ij(S)), 



(S)), (9) 
(10) 

where Hq(x) is the Heaviside step function that takes one 
for x > and zero for otherwise. C m i n is the lowest en- 
ergy conformation with a constraint on the total contact 
number N c , although it is not necessarily reached due to 
atom and residue connectivity, and steric hindrance. If 
each Aij is allowed to take either or 1 only, and also 
each Seij takes either one of two real values only to be 
able to satisfy Eq. , both the lower bounds of Eq. 
and Eq. are equal to each other. Otherwise, the lower 
bound of Eq. ([6]) is further bounded by the lower bound 
of Eq. , or the equality in Eq. cannot be achieved 
with < A^ < 1, but Eq. is always satisfied. If the 
total number of contacts N^is constrained to be equal 
to iV c (C m i n ), then e must be properly chosen as a non- 
positive value so that Eq. is satisfied. Otherwise, £0 
should be taken to be equal to zero to obtain the lower 
bound of Eq. 0. Eq. describes the lowest bound 
without any constraint on the number of contacts and 
corresponds to the energy of the conformation C m - m for 
£o = 0. 

and Eq. JTOl 



The potentials that satisfies Eq. 



5£ 13 {S) = eA^C), 



(7) 



are just a Go-like potential [5] , in which interactions be- 
tween native contact pairs are always more attractive 
than those between non-native pairs. Let us call pro- 
teins with a Go- like potential as ideal proteins. There 
are multiple levels of nativclikcliness in the Go-like po- 
tential. The most nativelikc potential of the present Go- 
like potentials is one in which all interactions between 
native contacts are attractive and other interactions are 
all repulsive. In other words, £ij is negative for native 
contacts and positive for non-native contacts. In such 
a Go-like potential, the native conformation can attain 
the lowest bound of Eq. , which is equivalent to Eq. 
with £0 = 0. A less native like potential is one in 
which interactions between non-native contact pairs can 
be attractive but always less attractive than those be- 
tween native contact pairs. An ideal protein with such a 
potential can attain Eq. with a proper value of Eq, 
which is the threshold energy for native and non-native 
contacts. In real protein, we should define £0 as a thresh- 
old of contact energy under which unit pairs tend to be 
in contact in native conformations. 

In ideal proteins, the lowest energy conformation must 
be one for which the contact potential looks like a Go- 
like potential, and inversely the potential must be a Go- 
like potential for the lowest energy conformation. In real 
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proteins, it would be impossible that contact potentials 
for native structures are exactly like a Go-like potential 
such as Eq. ^ and Eq. (10 1, even though the contact 
potential considered here may be an effective one that 
includes not only actual pairwise interactions but also 
the effects of higher order interactions near native struc- 
tures. In other words, the lowest bound of Eq. ^ could 
not be achieved for real pairwise potentials, because of 
atom/residue connectivities and steric hindrance. How- 
ever, it is desirable to reduce frustrations among inter- 
actions so that an effective pairwise potential in native 
structures must approach the Go-like potential. Then, 
a question is how native contact energies approach the 
mathematical lowest limit. In the following, we will show 
tips of how the C-matrix should be designed to decrease 
the total energy towards the theoretical lowest limit. 

It should be noted here that the lowest energy con- 
formation, C-matrix, is considered for a given potential, 
_E-matrix, but not its inverse problem, which is to con- 
sider an optimum potential or an optimum sequence for 
a given conformation, that is, an optimum iS-matrix for 
a given C-matrix. In the inverse problem, the total par- 
tition function varies depending on each sequence, and 
it must be taken into account to evaluate the stability 
of the given C-matrix in relative to the other conforma- 
tions [ini [T71 HS1 [T9]. The Z-score of the energy gap 
between the given C-matrix and other compact confor- 
mations may be used to evaluate the optimality of each 
sequence [15j [20] . 



Spectral relationship between C- and i?-matrices 

We apply singular value decomposition to both C- 
matrix (generalized contact matrix) and i5-matrix (gen- 
eralized contact energy matrix) . C-matrix is decomposed 



as 



A y (C) = £|A M (G)|L iM (C)i^(C), (11) 



|Ai(C)| > ...> \X N (C)\ >0, 



(12) 



where A M (C) is the eigenvalue of A(C), and its absolute 
value, lA^C)!, is the ^th non-negative singular value of 
A(C) arranged in the decreasing order, and L^(C) = 
*(Li M , ...,Ljv m ) and R M (C) = '(-R^, . . . , Rn/j.) are 
the corresponding left and right singular vectors; both 
L = (Li, . . . ,Ljv) and R are orthonormal matrices. Note 
that the singular values for a symmetric matrix such 
as a contact matrix is equal to the absolute value of 
its eigenvalue. We choose the eigenvector correspond- 
ing to the eigenvalue A M (C) as a right singular vector 
R M (C) and if A M (C) > 0, L M (C) = R M (C) and otherwise 
MC)^-R^(C).^ 

Likewise, iJ-matrix, (£y(S')), is decomposed as 

£ij(S) = 5>„|[/ w (S)V^(S) + £o, (13) 

N > ...>M>o, (u) 



where the absolute value of the eigenvalue, |e^(5)|, and 
U„(S) = \U lv , . . . , U Nv ) and V„(S) = _ \V lv , . . . , V Nv ) 
are the z/th singular value, left and right singular vector of 
the matrix (S£ij(S)), respectively. We choose the eigen- 
vector corresponding to the eigenvalue e v (C) as a right 
singular vector V„(C) and if e v {G) > 0, U„(C) = V 1/ (C) 
and otherwise U„(C) = — V„(C). 

We then substitute Eq. ( [TT) and Eq. (13 1 into the 
definition of the total energy, Eq. 0, and obtain 



E C (C,S) = j££|A M (C)|MS)|uv(C,S) 



+e N c (C), 



(15) 



where 



UC,S) ee Y,L ip (C)U iv (S)J2Rj»(C)V ju (S) 



t L A1 (C)U l/ (5) t R AI (C)V y (5). 



(16) 



Because the first term in Eq. ( JT5] ) is simply the trace of 
the product of two matrices, tr [6t t A), Neumann's trace 
theorem [3T] leads to the following inequality: 



E C (C, S) 
1 
2 



> 



£ \\ 6 (C)et(S)\+e N c (C). (17) 



The equality in Eq. ( 17 I is achieved if and only if 



for {//|A M e M ^ 0}, 



(18) 



that is, all the corresponding left and right singular vec- 
tors of the C- and i?-matrices are exactly parallel/anti- 
parallel to each other. Then, regarding the singular val- 
ues as the elements of a vector, i.e., A(C) = *(Ai, . . . , Ajv) 
and e(S) = t {e\, . . . , En), the sum of the products of the 
eigenvalues of E- and C-matrices in Eq. ( fl7| can be 
bounded by the product of the norms of those two vec- 
tors, which is equal to the product of the norms of the 
vectors consisting of E- or C-matrix elements. As a re- 
sult, we obtain the lower bound corresponding to Eq. ([6]) 
already derived in the previous section: 

E C (C,S) > - i||A(C)|| {aA5£5#0} ||e(5)|| U | A{£ ^ 0} 

+e N c (C) (19) 

= - ol|££(S)||{£|A^O}l|A(C)|j U | A ^o} 



+e N c (C), 



(20) 



where || • • • ||{f|Af£f^o} means the norm in the subspace 
of A^££ 7^ 0. The equality of Eq. (19) is achieved if 



and only if the values of the eigenvalues of C-matrix are 
proportional to those of .E-matrix; 



e £ (S) = eX £ (C) for {£|A^ + 0}. 



(21) 



Note that e is a negative constant due to Eq. ( [181). This 
condition with Eq. (18 1 corresponds to Eq. but 
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the spectral representation of C- and ©-matrices reveals 
that the relation of Eq. (21 1 is required only for the 
eigenspaces of Af£^ ^ 0. 



Is a pairwise residue-residue potential sufficient 
to make native structures the unique lowest en- 
ergy conformations ? 

If there exists £ such that = 0, and the C-matrices 
for two conformations C and C satisfy ( t C/(A(C) — 
A(C"))V0« = for {£|e 5 ? 0} and N C (C) = N C (C), 
those two conformations have the same conformational 
energy, because the total contact energy can be repre- 
sented as 

E C (C,S) = Y<\ £ »\CUA(C)V)^ + s a N c (C). (22) 

V 

If the contact interactions are genuine two-body between 
residues, £ij(S) and 5£ij(S) will depend only on the 
residue type of ith and jth units and therefore rank^SSij) 
will be less than or equal to the number of amino acid 
types in a protein; therefore, rank(f>£y) < 20. Thus, 
in the case of genuine two-body interactions between 
residues, there must exist £ such that = for any chain 
longer than 20 residues, that is, multiple C-matrices with 
the same energy. In other words, other interactions than 
pairwise interactions are needed to make native struc- 
tures the unique lowest energy conformations. A certain 
success [22] of statistical potentials of genuine two body 
in identifying native structures as the unique lowest en- 
ergy conformations indicates that most of the eigenspaces 
of E£ — 0, especially in orientation-dependent potentials, 
may be significantly reduced or even disallowed for short 
proteins by atom/residue connectivities and steric hin- 
drance. It may be worthy of note that the number of 
possible C-matrices is the order of 2 N ( N ~ X ^ 2 but the 
conformational entropy of self-avoiding chains is propor- 
tional to at most N, where N is chain length; that is, 
vast conformational space becomes disallowed by chain 
connectivity and steric hindrance. However, it would be 
not surprising even if a two-body contact potential is in- 
sufficient to make all the native structures be the unique 
lowest energy conformations, especially for long amino 
acid sequences. Actually it was reported [26] [27] [25] that 
it is impossible to optimize a pairwise potential to iden- 
tify all native structures. Multi-body interactions [23] 
may be required as a mean-field or even explicitly to- 
gether with the two-body interactions, as well as other 
interactions such as secondary structure potentials [24 . 



Relationship between a contact number vector n 
and the eigenvectors of C-matrix 



energy is. The eigenvalue A^ satisfies 



Eq. (17 1 indicates that the larger the principal eigen- 



t R M (C)n(C) 
f R M (C) • 1 

(nl) 1/2 4 R,n| 



iII/Cr^M), 



(23) 
(24) 



where 'R^n/HnH is the cosine of the angle between 
the contact number vector n and eigenvector R M , and 
* R, M 1 / 1 1 1 11 is one between the eigenvector R M and the 
vector 1 whose elements are all equal to one. (n%) rep- 
resents the second moment of contact numbers over all 
units. We can say that the eigenvalue A M is equal to the 
weighted average of contact number rii with each compo- 
nent of the eigenvector, i?^, and also that it is roughly 
proportional to the square root of the second moment of 
contact numbers. The principal eigenvalue has a value 
within the range of 2N C /N < \i < max, rii 29J. The 
larger the ratio, t R M n||l||/( t R AI l||n||), of the cosine is, 
the larger the eigenvalue A M becomes. 



Relationship between a contact number vector n 
and the eigenvectors of ©-matrix 

A contact number vector is C-matrix summed over a 
row or a column. Thus, to obtain a relationship between 
the contact number vector, n, and the principal eigen- 
vector of ©-matrix, an averaging of ©-matrix over a row 
or a column is needed. 

We approximate the total contact energy as follows by 
replacing 5£ij by its average over index j , S£ i, , and then 
obtain an approximate expression for the lower bound of 
the total contact energy: 

E C (C, S) 

w \ E E^ E <^fe(S)]Ay(0 + e N c (C) (25) 



3 k 



t <Jf.(5)n(C)+e o ^c(C0 



> - -||<yf.(5)||||n((7)||+eoiVc(C), 



(26) 
(27) 



where the mean contact energy vector 8£, is define d as 
S£.(S) = (jjJ2k S ^k(S)). The equality in Eq. p7| 



holds if and only if the two vectors S£ . and n are anti- 
parallel: 



8£.(S) 
\\5£.(S)\\ 



n(C) 
|n(C)|| 



(28) 



Eq. (28 1 above is equivalent to the following relation 



between the contact number vector and the eigenvector 
of .©-matrix: 



^nlllH 
'Vallnll 



(£„(^ *V„l/||l||))a)Va- 



(29) 



value is, the lower the lower bound of the total contact 



If ©-matrix can be well approximated by a primary eigen- 
vector term only, then this condition leads to the parallel 
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orientation between n and the primary eigenvector of E- the contact energies can be represented by 
matrix, that is, *Vin/||n|| ~ 1. It has been reported that 



the contact number vector is highly correlated with the 
principal eigenvector of the C-matrix [7] |S] . 

If the conformation for the lower bound of the total 
energy is also the lower-bound conformation even for this 
averaging over _E-matrix, Eq. (28 1 or Eq. p9| above 
together with Eq. (pi, Eq. ([24 s n = £ A M R M ('U M 1) 



and 5£ = J2„ e„V^(*V ly l) leads to Eq. Q between the 
eigenvalues of C- and i?-matrices as follows: 



-(E € (Af ^l/llill) 2 ) 1 /^ 



if = ±V C (30) 



e 



with a negative constant, e < 0, (31) 
where s is a constant taking any negative value. 



Cab 



, CvQ auQb 



(33) 



where e„ and Q ol , are eigenvalues and eigenvectors for 
the second term of Eq. (32 1 with a constant eo. Li ct 
al. [32] showed that the contact potential [3UJ [3T] cor- 
responding to (3' I 'a' — 1 between residues can be well 
approximated by the principal eigenvector term together 
with a constant term. 

Then, the following relationship is derived for the 
eigenvalues and eigenvectors between i?-matrix and 
the contact energy matrix (e a &): 



e rr + a eo, 



V iv « Q ai v/C£Ql iV y 



/2 



(34) 
(35) 

(36) 



III. DATA ANALYSES 



Eq. (17) indicates that with an optimum value for 



£o the spectral relationship of Eq. (18 1 between E- and 



C-matrices tend to be satisfied in the lowest energy con- 
formations. Here we will examine it by crudely evalu- 
ating pairwise interactions with a contact potential be- 
tween amino acids, which was estimated as a statistical 
potential from contact frequencies between amino acids 
observed in protein crystal structures. 



A pairwise contact potential used 

A contact potential used is a statistical estimate [5] of 
contact energies with a correction [2] for the Bethe ap- 
proximation [30] |3T] . The contact energy between amino 
acids of type a and b was estimated as 



e a b 



a'[Aer h ° 



Ae^ cthc 



^r he ]-(32) 



e rr is part of contact energies irrespective of residue types 
and is called a collapse energy, which is essential for a pro- 
tein to fold by cancelling out the large conformational en- 
tropy of extended conformations but cannot be estimated 
explicitly from contact frequencies between amino acids 
in protein structures. Ae^ tho and Se^§ thc are the values 
of Ae Qr and Se a i, evaluated by the Bethe approximation 
from the observed numbers of contacts between amino 
acids. Ae ar + e rr is a partition energy or hydrophobic 
energy for a residue of type a. 6e a b is an intrinsic contact 
energy for a contact between residues of type a and b; 
refer to [3] for those exact definitions. The proportional 
constants for correction was estimated as /3'/a' = 2.2 
and a' < 1 [4]. Here energy is measured in fcT units; k 
is the Boltzmann constant and T is temperature. With 



where dj is the amino acid type of ith residue, and N is 
protein length. It should be noted here that the eigen- 
vectors Vi V do not depend on the value of a'. 

C-matrix, A(C), is defined in such a way that non- 
diagonal elements take a value one for residues that are 
completely in contact, the value zero for residues that are 
too far from each other, and values between one and zero 
for residues whose distance is intermediate between those 
two extremes. Contacts between neighboring residues are 
completely ignored, that is A^ = for \i — j\ < 1. The 
geometric center of side chain heavy atoms or C a atom 
for glycine is used to represent each residue. Previously, 
this function was defined as a step function for simplicity. 
Here, it is defined as a switching function as follows; in 
the equation below to define residue contacts, means 
the position vector of a geometric center of side chain 
heavy atoms or the C a atom for glycine: 



A( r ii r j) 

S w (x,a,b) 



S w ( | r j r j | , d 1 , d 2 ) , 

1 for x < a 
[{b 2 -x 2 ) 2 /(b 2 -a 2 ) 3 ] 
x [3(b 2 - a 2 ) - 2(b 2 
for a < x < b, 
for b < x 



(37) 



2 )} (38) 



the spectral expansion of the second term of Eq. ( 32 1 



where S w is a switching function that sharply changes its 
value from one to zero between the lower distance d\ and 
the upper distance d\. Those critical distances d\ and d§ 
are taken here as 6.65 A and 7.35 A, respectively. 

Protein structures analyzed 

Proteins each of which is a single-domain protein repre- 
senting a different family of protein folds were collected. 
In the case of multi-domain proteins in which contacts 
between domains are significantly less that those within 
domains, a contact matrix could be approximated by a 
direct sum of subspaces corresponding to each domain. 
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This characteristic of multi-domain proteins has been 
used for domain decomposition [33 j and for identifica- 
tion of side-chain clusters in a protein [3H ES]- Thus, 
only single-domain proteins are used here. Release 1.69 
of the SCOP database [3J was used for the classification 
of protein folds. We have assumed that proteins whose 
domain specifications in the SCOP database consist of 
protein ID only, are single-domain proteins. Representa- 
tives of families are the first entries in the protein lists for 
each family in SCOP; if these first proteins in the lists are 
not appropriate (see below) to use, for the present pur- 
pose, then the second ones are chosen. These species are 
all those belonging to the protein classes 1 to 4; that is, 
classes of all a, all (3, a/ (3, and a + (3 proteins. Classes of 
multi-domain, membrane and cell surface proteins, small 
proteins, peptides and designed proteins are not used. 
Proteins whose structures [36] were determined by NMR 
or having stated resolutions worse than 2.0 A are removed 
to assure that the quality of proteins used is high. Also, 
proteins whose coordinate sets consist either of only C a 
atoms, or include many unknown residues, or lack many 
atoms or residues, are removed. In addition, proteins 
shorter than 50 residues are also removed. As a result, 
the set of family representatives includes 182 protein do- 



IV. RESULTS 

The spectral relationship between C-matrix and E- 
matrix is analyzed for single domain proteins that are 
representatives from each family of class 1 to 4 in the 
Scop database of version 1.69. The statistical potential 
used is crude, so that the following analyses arc limited 
only to relationships between the principal eigenvectors 
of C-matrix and of _E-matrix and contact number vec- 
tor. It should be noted here that the crude evaluation 
of the pairwise interactions may make their relationships 
unclear. 

Eq. (24) indicates that the eigenvalues of C-matrix 



are proportional to the square root of the second mo- 
ment of contact numbers. The proportional constant for 
the principal eigenvalue of C-matrix, that is, *Rin || 1 | 
/('Ril || n ||), is plotted for each protein in Fig. [I] The 
dotted lines are iso-cosine lines for the angle between the 
principal eigenvector of C-matrix and the contact num- 
ber vector, whose values are written in the figure. The 
ratios are scattered between 1.2 and 1.6, although the 
value of the ratio depends on the value of the abscissa, 
'Ril/ || 1 || . Cosine of angle is upper bounded by the 
value of one, and therefore the value of the ratio of the 
cosines becomes correlated with the value of the denom- 
inator of the ratio, i.e., 'Ril/ || 1 ||. The important fact 
is that the ratio takes values larger than one, making 
the principal eigenvalue larger. Here, it should be noted 
that the lower bound of conformational energy linearly 
depends on the principal eigenvalue of C-matrix; see Eq. 
(17 1. Thus, the larger the principal eigenvalue is, the 



0.4 0.5 0.6 0.7 0.8 

*R, 1/ II 1 II 

FIG. 1: The ratio of *Rm/ || n || to 'Ril/ || 1 || is shown 
for each of 182 proteins, which are representatives of single 
domain protein from each family of class 1 to 4 in the SCOP 
version 1.69. Ri and n are the principal eigenvector and the 
contact number vector of the native C-matrix, respectively. 
The dotted lines indicate the iso-value lines for *Rin/ || n ||, 
whose values are shown in the figure. 



lower the conformational energy becomes. In practice, 
this condition seems to yield the high correlation between 
the principal eigenvector and the contact number vector; 
the most values of the *Rin/ || n ||, are greater than 0.7. 



FIG. 2: The mean of 'Ri Vi over 182 proteins is plotted with 
plus marks against eo. These proteins are representatives of 
single domain protein from each family of class 1 to 4 in the 
SCOP version 1.69. Ri is the principal eigenvector of the 
native C-matrix. Vi is the principal eigenvector for _E-matrix 
with the value of eo specified on the abscissa. 

Now let us think about the relationship between the 
C-matrix and the pairwise interactions. Pairwise interac- 
tions between residues are evaluated by using a statistical 
estimate [5] of contact energies with a correction [J] for 
the Bethe approximation. Figure [2] shows the average of 
*RiVi over all proteins for each value of eo. The average 
('RiVi) takes the maximum value 0.699 at eo = 1.3, al- 
though its decrements according to the increase of eo are 
not large. In the following, eo = 1.3 is used to calculate 
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the eigenvectors of -©-matrices. 
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FIG. 3: The value of *RiVi is plotted against *Ril/||l|| 
for each of 182 proteins, which are representatives of single 
domain protein from each family of class 1 to 4 in the SCOP 
version 1.69. Ri is the principal eigenvector of the native 
C-matrix. Vi is the principal eigenvector for i?-matrix with 
eo = 1.3. The dotted line shows the line of equal values 
between the ordinate and abscissa. 

The value of *RiVi for each protein is plotted against 
the value of 'Ril/||l|| in Fig. [3j The value of *RiV! 
is larger for most of the proteins than that of 'Ril/||l||. 
If the direction of Ri is randomly distributed in the do- 
main of Rii > 0, the probability that 'RiVi is larger 
than 'Rjl/Hlll must be smaller than 0.5. Then, in 
such a random distribution, a probability to observe 
Fig. [3] in which 175 of 182 proteins fall into the re- 
gion of *RiVi > *Ril/||l||, must be smaller than 
i82Ci75(0.5) 175 = exp(— 91.6). Also t-test is performed 
for a correlation coefficient between Ri and V\ for each 
protein. The geometric mean of probabilities for sig- 
nificance over 182 proteins examined here is equal to 
exp(— 18.4). Thus, it is statistically significant that the 
direction of the vector Ri is closer to Vi rather than 1 
whose elements do not depend on residues in proteins, 
This fact indicates again that the parallel orientation be- 
tween the principal eigenvectors of C-matrix and of E- 
matrix is favored. 

Eq. (28) indicates that the the mean contact energy 



vector d£,(= (jjJ2k^ik(S))) being anti-parallel to the 
contact number vector is favorable to decrease the confor- 
mational energy. Figure [4] does not show strong but sta- 
tistically significant tendency that the value of — *<5£.n/(| 
£, HI! n ||) tends to be larger than *nl/(|| n ||| 1 ||); in 
t-tests for correlation coefficients between 8£ m and n, the 
geometric mean of probabilities for significance over 182 
proteins is equal to exp(— 27.9). If the ©-matrix can be 
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FIG. 4: The value of -*<5£.n/(|| 5£. \\ || n ||) is plotted against 
*nl/(|| n HI! 1 ||) for each of 182 proteins, which are repre- 
sentatives of single domain protein from each family of class 
1 to 4 in the SCOP version 1.69. eo = 1.3 is used for the 
i?-matrix. The dotted line shows the line of equal values be- 
tween the ordinate and abscissa. 



approximated by the primary eigenvector term, this fact 
indicates that the contact number vector tends to be par- 
allel to the principal eigenvector of ©-matrix. Actually 
this is the case for the present estimate of contact ener- 
gies; the figure of *Vin/ || n || versus *nl/(|| n |||| 1 ||) 
is not shown. In t-tests for correlation coefficients be- 
tween Vi and n, the geometric mean of probabilities for 
significance is equal to exp(— 28.8). 

Here, we have shown that the principal eigenvector 
among other eigenvectors of C-matrix seems to be a main 
contributor to minimize conformation energy. It is im- 
portant to take notice that the principal eigenvector of C- 
matrix corresponds to the lower frequency normal modes 
of protein motion. Let us think about a Kirchhoff matrix 
that is defined as 



(39) 



where is a Kronecker's delta. The eigenvalue of the 
Kirchhoff matrix arc equal to the square of normal mode 
angular frequency in a system in which i and jth units 
are connected to each other by a spring with a spring 
constant equal to Ay. If contact number rij is equal 
to a constant n c irrespective of unit i, then the eigen- 
value of the Kirchhoff matrix is equal to n c — A M . In 
other words, in this case the principal eigenvector of C- 
matrix corresponding to the largest eigenvalue is equal 
to the eigenvector of the Kirchhoff matrix corresponding 
to the smallest eigenvalue, that is, the lowest frequency 
normal mode corresponding to a motion that leads to 
the large conformational change [37 . In actual proteins, 
contact number n, depends on unit i, and then the cor- 
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FIG. 5: The norms of C-matrix eigenvectors, R M , projected 
on the subspace consisting of the n lowest normal modes of a 
Kircliihoff matrix corresponding to the C-matrix, are plotted 
against n. P n means a projection operator on the n lowest 
normal modes of the Kirchhoff matrix. Plus marks indicate 
the norm of the principal eigenvector of the C-matrix of each 
of 182 proteins projected on each subspace consisting of the 
n lowest normal modes indicated on the abscissa. The solid 
curves with cross marks indicate those norms averaged over 
all proteins; their curves from the left to the right show those 
values for the first, the second, and the third principal eigen- 
vectors of C-matrix, respectively. 



respondcnce between the eigenvectors of C-matrix and 
the Kirchhoff matrix would become vague, but it will be 
expected that the principal eigenvector of C-matrix be- 
longs to a subspace consisting of lower frequency normal 
modes. 

In Fig. [5] plus marks indicate the norm of the princi- 
pal eigenvector of C-matrix of each of 182 proteins pro- 
jected on each subspace consisting of the n lowest normal 
modes indicated on the abscissa. In most of the proteins, 
the principal eigenvector of C-matrix corresponds to the 
lower frequency modes of the Kirchhoff matrix. The solid 
curves with cross marks indicate those norms averaged 
over all proteins; their curves from the left to the right 
show those values for the first, the second, and the third 
principal eigenvectors of C-matrix, respectively. The 
solid curve for the principal eigenvector shows that about 
70% of the principal eigen vector of the C-matrix can be 
explained by only 10 lower frequency modes. Thus, the 
principal eigenvector of C-matrix is not only an impor- 
tant contributor to minimize conformation energy, but 
also corresponds to the lower frequency normal modes of 
protein motion. 



V. DISCUSSION 

The lower bounds of the total contact energy lead to 
the relationship between E- and C-matrices such that 



the contact potential looks like a Go-like potential. Such 
a relationship may be realized only for ideal proteins, 
but in real proteins, atom- and residue-connectivity and 
steric hindrance not included in the contact energy can 
significantly reduce conformational space; the number of 
possible C-matrices is the order of 2 N ( N ~ X ^ 2 but the 
conformational entropy of self-avoiding chains is propor- 
tional to at most N, where N is chain length. As a result, 
Eq. (18) is expected to be approximately satisfied only 



for some singular spaces, probably for singular values tak- 
ing relatively large values, but at least for the primary 
singular space. It was confirmed in the representative 
proteins that the inner products of the principal eigenvec- 
tors of E- and C-matrices are significantly biased toward 
the value, one, at a certain value of the threshold energy 
Eg for contacts, where their average over all proteins has 
a maximum; see Fig. [3] Parallel relationships were also 
indicated and confirmed between the primary eigenvec- 
tor Ri and the contact number vector n of C-matrix, 
and between the mean contact energy vector 8£. and the 
contact number vector n; see Fig. [I] and Fig. |4j In these 
analyses, a statistical potential was used to evaluate con- 
tact energies between residues, and the coarse grain of the 
evaluations limits the present analysis to a relationship 
between the primary eigenvectors of E- and C-matrices, 
and also can make the relationship between these matri- 
ces vague. However, the results clarify significance of the 
principal eigenvectors of E- and C-matrices and the con- 
tact number vector in protein structures. Here, it may be 
worthy of note that the primary eigenvector of C-matrix 
corresponds to the lower frequency normal modes of pro- 
tein structures. 



The condition for the lowest bound of energy, Eq. ( 10 I 



indicate that eo in real proteins corresponds to a thresh- 
old of contact energy for a unit pair to tend to be in con- 
tact in the native structures. In principle, such a thresh- 
old for contact energy depends on the size of protein and 
protein architecture; it should be noted that many types 
of interactions in real proteins are missed in representing 
interactions by contact potentials. The estimate of eo 
shown in Fig. [2] is an estimate only for the present spe- 
cific type of a contact potential. The important things 
are that the total contact energy is bounded by Eq. pi 
with a constant term, and that the spectral relationships 
of Eq. (18) and Eq. (21) between E- and C-matrices are 



expected for the conformations of the lower bounds if E- 
matrix is decomposed with a constant term as shown in 
Eq. p|. 



Besides that, the spectral representation of C- and 
^-matrices reveals that pairwise residue-residue interac- 
tions, which depends only on the types of interacting 
amino acids but not on other residues in a protein, are 
insufficient and other interactions including residue con- 
nectivities and steric hindrance are needed to make na- 
tive structures the unique lowest energy conformations. 
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